In this Take-home Exercise, I will explore to reveal the patterns of community interactions of the city of Engagement (Ohio US) by using social network analysis approach.
Challenge 1: Demographics and Relationships involves understanding the city’s demographics. Given social networks and other information about the city, you will analyze the available data to prepare a one-page fact sheet about the city’s demographics, its neighborhoods, and its business base.
In Challenge 1, you will use visual analytic techniques to address the following questions:
In this take-home exercise, social network approach is used to reveal the patterns of community interactions of the city of Engagement, Ohio USA while addressing the questions stated in the Task section.
The data are processed by using appropriate tidyverse family of packages.
The picture below shows a sketch of the initial design proposed.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
The chunk code below will do the trick.
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'ggmap',
'plotly', 'DT', 'patchwork','crosstalk')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports SocialNetwork.csv and
Participants.csv from the rawdata folder into R by using read_csv()
of readr
package and save them as tibble data frames called social and
participants.
social <- read_csv("rawdata/SocialNetwork.csv")
participants <- read_csv("rawdata/Participants.csv")
glimpse(social)
We will add 1 to each Participant Id in social and participants as vertex id is not allowed to be zero.
participants$participantId = participants$participantId + 1
social$participantIdFrom = social$participantIdFrom + 1
social$participantIdTo = social$participantIdTo + 1
We will re-order the fields in social data frame such that participantIdFrom & participantIdTo will become the first two columns. Weekday and year_month columns are also added to this data frame.
social <- social %>%
select(participantIdFrom, participantIdTo, timestamp) %>%
mutate(weekday = wday(timestamp,
label = TRUE,
abbr = FALSE))
For the ease of visualization, we will aggregate by participantIdFrom, participantIdTo and Weekday using the code chunk below. Weight is set as greater than 55 to ensure a simpler network graph
Data frames social, social_aggregated and participants are saved in RDS format to avoid uploading large files to Git.
The code chunk below helps to ensure the participants_selected contains all distinct Participant Ids from the social_aggregated data frame.
We will also create a age grouping further analysis. As for
educationLevel, we need to implement an alternative way of sorting since
as_tbl_graph() will turn the usual factor way of sorting
back to characters data type.
participants_selected$age_group <- cut(participants_selected$age,
breaks = c(-Inf,21, 31, 41, 51, Inf),
labels = c("<20", "21-30","31-40",
"41-50","51-60"),
right = FALSE)
participants_selected$educationLevel <- sub('Low', 'a-Low',participants_selected$educationLevel)
participants_selected$educationLevel <- sub('HighSchoolOrCollege', 'b-HighSchoolOrCollege',participants_selected$educationLevel)
participants_selected$educationLevel <- sub('Bachelors', 'c-Bachelors',participants_selected$educationLevel)
participants_selected$educationLevel <- sub('Graduate', 'd-Graduate',participants_selected$educationLevel)
In the code chunk below, ggplotly() of
plotly and bscols() of
crosstalk are used to plot an interactive bar chart
with a table to visualize the number of participants in network analysis
by age group.
dd <- highlight_key(participants_selected)
graph1 <- ggplot(dd, aes(x = age_group)) +
geom_bar(fill = '#468499') +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey')) +
ggtitle("Distribution of Participant Count Across Age Group") +
xlab("Age Group") +
ylab("Count of Par")
gg <- highlight(ggplotly(graph1),
"plotly_selected")
crosstalk::bscols(gg,
widths = c(12,12),
DT::datatable(dd,
rownames = FALSE,
colnames = c('Participant Id' = 'participantId',
'Household Size' = 'householdSize',
'Have Kids' = 'haveKids',
'Age' = 'age',
'Education Level' = 'educationLevel',
'Interest Group' = 'interestGroup',
'Joviality' = 'joviality',
'Age Group' = 'age_group'),
filter = 'top',
class = 'display'))
In the code chunk below, tbl_graph() of
tinygraph package is used to build a tidygraph’s
network graph data.frame.
social_graph <- graph_from_data_frame(social_aggregated,
vertices = participants_selected) %>%
as_tbl_graph()
social_graph
# A tbl_graph: 414 nodes and 6098 edges
#
# A directed multigraph with 105 components
#
# Node Data: 414 x 8 (active)
name householdSize haveKids age educationLevel interestGroup
<chr> <dbl> <lgl> <dbl> <chr> <chr>
1 5 3 TRUE 43 c-Bachelors H
2 9 3 TRUE 20 c-Bachelors G
3 11 3 TRUE 48 b-HighSchoolO~ D
4 13 3 TRUE 34 b-HighSchoolO~ D
5 15 3 TRUE 38 d-Graduate H
6 19 3 TRUE 20 d-Graduate I
# ... with 408 more rows, and 2 more variables: joviality <dbl>,
# age_group <chr>
#
# Edge Data: 6,098 x 4
from to weekday weight
<int> <int> <chr> <int>
1 1 119 Sunday 57
2 1 119 Monday 57
3 1 119 Tuesday 57
# ... with 6,095 more rows
The code chunk below uses ggraph(),
geom-edge_link() and geom_node_point() to plot
a network graph by using social_graph.
ggraph(social_graph) +
geom_edge_link(aes(colour = 'red')) +
geom_node_point() +
labs(title = "Network Graph in City of Engagement") +
theme_graph()
We can see that this give rises to a series of network plots where we can find relatively simple network such as two persons and relatively complex network such as the one on the upper right corner. Our next task is to extract out the the most, moderate and least complex networks and to investigate them in greater detail.
The 3 networks varied in complexity are identified as follows:
ggraph(social_graph) +
geom_edge_link(aes(width=weight),
alpha=0.05) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = age_group,
size = 2)) +
labs(title = "Network Graph by Age Group") +
theme_graph()
ggraph(social_graph) +
geom_edge_link(aes(width=weight),
alpha=0.05) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = interestGroup,
size = 2)) +
labs(title = "Network Graph by Interest Group") +
theme_graph()
ggraph(social_graph) +
geom_edge_link(aes(width=weight),
alpha=0.05) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = as.factor(householdSize),
size = 2)) +
labs(title = "Network Graph by Household Size") +
theme_graph()
ggraph(social_graph) +
geom_edge_link(aes(width=weight),
alpha=0.05) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = educationLevel,
size = 2)) +
labs(title = "Network Graph by Education Level") +
theme_graph()
The possible explanations for the most, moderate and least complex networks identified are:
This network seems to represent a company with possibly 5 departments (represented by 5 inner networks with greater weight). Most departments have a good spread of the age distribution with some youths and senior leaders. Within each inner networks, they have very different interest groups, which might be true in a company set-up. There is also at least 1 Bachelor/Graduate in each inner network, which may also coincides with the company structure.
This network seems to represent a group of friends which predominantly are made up of High School/College education levels and have similar interest in interest group D & G. This group seems to also have 2 inner networks. The network on the left represent those who are more junior in age and are mostly single. The network on the right represent those who are more senior and are having bigger household size.
This network seems to represent 2 siblings who have the same interest and household size. Their difference in education level can be explained by the difference in age.
In the code chunk below, facet_nodes() of
ggraph is used to plot a framed facet network
graph.
set_graph_style()
ggraph(social_graph) +
geom_edge_link(aes(width=weight),
alpha=0.05) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = educationLevel),
size = 2) +
facet_nodes(~educationLevel)+
th_foreground(foreground = "grey80",
border = TRUE) +
labs(title = "Facet Network Graph by Education Level") +
theme(legend.position = 'bottom')
We can see that there are a greater level of connection among participants of HighSchoolOrCollege / Bachelor levels. There is limited connection among participants of Graduate level while there is almost no connection among those with Low education level.
In the code chunk below, some data wrangling are used to prepare the
data for visNetwork, and visOptions() is
used to incorporate interactivity features in the network graph.
social_edges_aggregated <- social %>%
left_join(participants_selected, by = c("participantIdFrom" = "participantId")) %>%
rename(from = participantIdFrom) %>%
left_join(participants_selected, by = c("participantIdTo" = "participantId")) %>%
rename(to = participantIdTo) %>%
group_by(from, to) %>%
summarise(weight = n()) %>%
filter(from!=to) %>%
filter(weight > 50) %>%
ungroup()
participants_selected_int <- participants_selected_int %>%
rename(group = educationLevel, id = participantId)
visNetwork(participants_selected_int,
social_edges_aggregated) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)